Grid Checkpointing Architecture - a revised proposal
نویسندگان
چکیده
Contemporary Grid environments are featured by an increasingly growing virtualization and distribution of resources. Such situations impose greater demands on load-balancing and fault-tolerant capabilities. The checkpointrestart mechanism seems to be the most intuitive tool that can fulfill the specific requirements. However, as there is still a lack of widely available, production-grade checkpoint-restart tools, the higher level checkpoint-restart services are not well developed yet. One of the goals of the CoreGRID Network of Excellence is to define the high-level checkpoint-restart Grid Service and to locate it among other Grid Services. We aim to define both the abstract model of that service and the lower layer interface that will allow the service to cooperate with diverse existing and future checkpoint-restart tools. The paper is the first step on the road to this goal. It includes the overall sketch of the architecture of the considered service and its connection with the actual checkpoint-restart tools.
منابع مشابه
Stability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid
Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...
متن کاملAn Architecture for Checkpointing and Migration of Distributed Components on the Grid
Sriram Krishnan AN ARCHITECTURE FOR CHECKPOINTING AND MIGRATION OF DISTRIBUTED COMPONENTS ON THE GRID A computational Grid is a set of hardware and software resources that provide seamless, dependable, and pervasive access to high-end computational capabilities. The Grid differs from other computational resources such as traditional supercomputers and clusters by the following key features: (1)...
متن کاملThe Architecture of the XtreemOS Grid Checkpointing Service
The EU-funded XtreemOS project implements a grid operating system (OS) transparently exploiting distributed resources through the SAGA and POSIX interfaces. XtreemOS uses an integrated grid checkpointing service (XtreemGCP) for implementing migration and fault tolerance. Checkpointing and restarting applications in a grid requires saving and restoring applications in a distributed heterogeneous...
متن کاملIntegrated Process Management in a Grid Checkpointing Environment
For many businesses, the ability to manage dynamic distributed environments has become a key success factor. Joint industry and/or academic cooperations exploit resources spawning multiple administrative domains with millions of nodes and thousands of users. In order to run the overall business effectively Grid technologies can be applied. The EU-funded XtreemOS project implements a grid operat...
متن کاملDEE: A Distributed Fault Tolerant Workflow Enactment Engine for Grid Computing
It is a large and complex task to design and implement a workflow management system that supports scalable executions of largescale scientific workflows in distributed and unstable Grid environments. In this paper we describe the Distributed workflow Enactment Engine (DEE) of the ASKALON application development environment for Grid computing. DEE proposes a de-centralized architecture that simp...
متن کامل